Phasing of many thousands of genotyped samples.

نویسندگان

  • Amy L Williams
  • Nick Patterson
  • Joseph Glessner
  • Hakon Hakonarson
  • David Reich
چکیده

Haplotypes are an important resource for a large number of applications in human genetics, but computationally inferred haplotypes are subject to switch errors that decrease their utility. The accuracy of computationally inferred haplotypes increases with sample size, and although ever larger genotypic data sets are being generated, the fact that existing methods require substantial computational resources limits their applicability to data sets containing tens or hundreds of thousands of samples. Here, we present HAPI-UR (haplotype inference for unrelated samples), an algorithm that is designed to handle unrelated and/or trio and duo family data, that has accuracy comparable to or greater than existing methods, and that is computationally efficient and can be applied to 100,000 samples or more. We use HAPI-UR to phase a data set with 58,207 samples and show that it achieves practical runtime and that switch errors decrease with sample size even with the use of samples from multiple ethnicities. Using a data set with 16,353 samples, we compare HAPI-UR to Beagle, MaCH, IMPUTE2, and SHAPEIT and show that HAPI-UR runs 18× faster than all methods and has a lower switch-error rate than do other methods except for Beagle; with the use of consensus phasing, running HAPI-UR three times gives a slightly lower switch-error rate than Beagle does and is more than six times faster. We demonstrate results similar to those from Beagle on another data set with a higher marker density. Lastly, we show that HAPI-UR has better runtime scaling properties than does Beagle so that for larger data sets, HAPI-UR will be practical and will have an even larger runtime advantage. HAPI-UR is available online (see Web Resources).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering.

Whole-genome association studies present many new statistical and computational challenges due to the large quantity of data obtained. One of these challenges is haplotype inference; methods for haplotype inference designed for small data sets from candidate-gene studies do not scale well to the large number of individuals genotyped in whole-genome association studies. We present a new method a...

متن کامل

The Clark Phase-able Sample Size Problem: Long-Range Phasing and Loss of Heterozygosity in GWAS

A phase transition is taking place today. The amount of data generated by genome resequencing technologies is so large that in some cases it is now less expensive to repeat the experiment than to store the information generated by the experiment. In the next few years, it is quite possible that millions of Americans will have been genotyped. The question then arises of how to make the best use ...

متن کامل

Recursive Long Range Phasing and Long Haplotype Library Imputation: Building a Global Haplotype Library for Holstein cattle

Long range phasing (LRP) is a fast and accurate rule based method which uses information from both related and unrelated individuals by invoking the concepts of surrogate parents and Erdös numbers (Kong et al., 2008). Recursive long range phasing and long haplotype imputation (RLRPLHI; Hickey et al., 2009) is an extended LRP algorithm with increased robustness partially due to the extra long ha...

متن کامل

Holographic correction and phasing of large sparse-array telescopes.

I have constructed a 1-m-diameter telescope using separate, low-quality spherical primary mirror segments. A single hologram of the mirrors is used to correct the random surface distortions as well as spherical aberration, while simultaneously phasing the individual apertures together. I present experimental results of the removal of an error of thousands of waves to produce a diffraction-limit...

متن کامل

Phasing for medical sequencing using rare variants and large haplotype reference panels

MOTIVATION There is growing recognition that estimating haplotypes from high coverage sequencing of single samples in clinical settings is an important problem. At the same time very large datasets consisting of tens and hundreds of thousands of high-coverage sequenced samples will soon be available. We describe a method that takes advantage of these huge human genetic variation resources and r...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • American journal of human genetics

دوره 91 2  شماره 

صفحات  -

تاریخ انتشار 2012